Efficient Bayesian Methods for Clustering

نویسنده

  • Katherine Ann Heller
چکیده

One of the most important goals of unsupervised learning is to discover meaningful clusters in data. Clustering algorithms strive to discover groups, or clusters, of data points which belong together because they are in some way similar. The research presented in this thesis focuses on using Bayesian statistical techniques to cluster data. We take a model-based Bayesian approach to defining a cluster, and evaluate cluster membership in this paradigm. Due to the fact that large data sets are increasingly common in practice, our aim is for the methods in this thesis to be efficient while still retaining the desirable properties which result from a Bayesian paradigm. We develop a Bayesian Hierarchical Clustering (BHC) algorithm which efficiently addresses many of the drawbacks of traditional hierarchical clustering algorithms. The goal of BHC is to construct a hierarchical representation of the data, incorporating both finer to coarser grained clusters, in such a way that we can also make predictions about new data points, compare different hierarchies in a principled manner, and automatically discover interesting levels of the hierarchy to examine. BHC can also be viewed as a fast way of performing approximate inference in a Dirichlet Process Mixture model (DPM), one of the cornerstones of nonparametric Bayesian Statistics. We create a new framework for retrieving desired information from large data collections, Bayesian Sets, using Bayesian clustering techniques. Unlike current retrieval methods, Bayesian Sets provides a principled framework which leverages the rich and subtle information provided by queries in the form of a set of examples. Whereas most clustering algorithms are completely unsupervised, here the query provides supervised hints or constraints as to the membership of a particular cluster. We call this “clustering on demand”, since it involves forming a cluster once some elements of that cluster have been revealed. We use Bayesian Sets to develop a content-based image retrieval system. We also extend Bayesian Sets to a discriminative setting and use this to perform automated analogical reasoning. Lastly, we develop extensions of clustering in order to model data with more complex structure than that for which traditional clustering is intended. Clustering models traditionally assume that each data point belongs to one and only one cluster, and although they have proven to be a very powerful class of models, this basic assumption is somewhat limiting. For example, there may be overlapping regions where data points actually belong to multiple clusters, like movies which can each belong to multiple genres. We extend traditional mixture models to create a statistical model for overlapping clustering, the Infinite Overlapping Mixture Model (IOMM), in a nonparametric Bayesian setting, using the Indian Buffet Process (IBP). We also develop a Bayesian Partial Membership model (BPM), which allows data points to have partial membership in multiple clusters via a continuous relaxation of a finite mixture model.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Bayesian Clustering for Reinforcement Learning

A fundamental artificial intelligence challenge is how to design agents that intelligently trade off exploration and exploitation while quickly learning about an unknown environment. However, in order to learn quickly, we must somehow generalize experience across states. One promising approach is to use Bayesian methods to simultaneously cluster dynamics and control exploration; unfortunately, ...

متن کامل

An Efficient Bayesian Optimal Design for Logistic Model

Consider a Bayesian optimal design with many support points which poses the problem of collecting data with a few number of observations at each design point. Under such a scenario the asymptotic property of using Fisher information matrix for approximating the covariance matrix of posterior ML estimators might be doubtful. We suggest to use Bhattcharyya matrix in deriving the information matri...

متن کامل

Uncertainty Modeling of a Group Tourism Recommendation System Based on Pearson Similarity Criteria, Bayesian Network and Self-Organizing Map Clustering Algorithm

Group tourism is one of the most important tasks in tourist recommender systems. These systems, despite of the potential contradictions among the group's tastes, seek to provide joint suggestions to all members of the group, and propose recommendations that would allow the satisfaction of a group of users rather than individual user satisfaction. Another issue that has received less attention i...

متن کامل

E-Bayesian Estimations of Reliability and Hazard Rate based on Generalized Inverted Exponential Distribution and Type II Censoring

Introduction      This paper is concerned with using the Maximum Likelihood, Bayes and a new method, E-Bayesian, estimations for computing estimates for the unknown parameter, reliability and hazard rate functions of the Generalized Inverted Exponential distribution. The estimates are derived based on a conjugate prior for the unknown parameter. E-Bayesian estimations are obtained based on th...

متن کامل

An Introduction to Inference and Learning in Bayesian Networks

Bayesian networks (BNs) are modern tools for modeling phenomena in dynamic and static systems and are used in different subjects such as disease diagnosis, weather forecasting, decision making and clustering. A BN is a graphical-probabilistic model which represents causal relations among random variables and consists of a directed acyclic graph and a set of conditional probabilities. Structure...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008